Welcome to the Data Science World

This course is divided into three Parts

  • Part 1: Getting Started with R for Data Science
    • Introduction to data science.
    • CRISP-DM
    • R programming language.
    • Data Scientist’s Tools (Git + Github + Rstudio).
    • Brief introduction to data visualization

Welcom to DS.. Continue

  • Part 2: Data Wrangling & Statistics
    • Reading data from different sources.
    • Handling inconsistent,missing data & outliers
    • Data normalization & transformation
    • Dimension reduction
    • Intro to statistics

Welcom to DS.. Continue

  • Part 3: Modeling and Evaluation
    • Supervised Learning
      • Linear regression
      • Logestic regression
      • Decision trees
      • K-NN classification
    • Unupervised Learning
      • K-Mean clustering
      • Hierarchical clustering
    • Association Rules

Data are everywhere

Why study data science?

Dell sales reps productivity

Problem Sales reps spend a lot of time doing online research and gathering information from disparate sources before they actually engage with a sales prospects
Data Used LinkedIn and other public sources available on the Web
Solution Predictive model developed by ‘Lattice Engines’ company
Benefits Productivity Increased by 100%

Read full story here

Massachusetts medicaid fraudulent claims

Problem There were many fraudulent medicaid claims that has been costing Massachusetts State millions of dollars
Data Used Massachusetts’ Medicaid Management Information System (MMIS) and analyzes public and other Commonwealth data stored in a data warehous
Solution Predictive model called ‘NetReveal’
Benefits Recovered 2 million dollars in first six months

Read full story here

IBM & epic predective model saves lives

Problem Carilion Clinic will have to spend years to go over their data to identify patients with risk of heart failure. By then it will be too late
Data Used IBM’s data warehouse, and Epic electronic health records
Solution Watson predictive model and natural language processing
Benefits Identified 8,500 patients who are at risk of congestive heart failure within one year

Read full story here

President obama’s campaign

Problem Obama Campaign Staff needed to focus on the people who are more likely to vote and make sure they vote
Data Used Web, social media and collected Campaign Offices data
Solution Complex micro-targeting model
Benefits Predicted that Obama would receive 56.4% of the vote; the Obama share of the actual vote was 56.6% in Ohaio. Eventually made Obama president.

Read full story here

What is Data Science?

Data science

is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. ~ Wikipedia

When do we need to use data science

We need to make sense of the world around us. Scientists in many fields have been developing mathimatical models to explain and predict many natural phenomenon for years. For example, physicists observe the motion of planets and try to make equations to explain it. They believe that events are indicative to the underlying truth.

Data scienintist do exactly the same. We collect data because we believe that there is an underlying pattern to discover. Data science uses computer algorithms to automte the process of modeling the pattern of phenomenon.

The way scientists see the world

There is an equation for every problem on this earth. This equation or process is $f : x y $. Because it is sometimes impossible to find this equation exactly we approximate it using data. In other words, instead of trying to find the equation we look at the data and try to extract an equation that best fit.

instead of finding the true \(f\) we try to find \(g\) where \(g \approx f\).

What are the types of data?

  • Structured data
    • Data that has pre-defined format. We mainly refer to the data that can be stored in tabular format.

What are the types of data?

  • Unstructured data
    • Those include everything else, from texts on websites and social media to uploaded videos and music.

Structured vs unstructured

Structured vs unstructured

Dealing with unstructured data is beyond the scope of this course. So we will focus on Structure data in this course.

Are all data equal?

The answer is : NO. There are different levels

Are all data equal?

Tasks of data scientist

Data Scienctists are invlolved in a mainly 5 tasks in their daily work. They try to

  • Describe the data.
  • Estimate or predict it.
  • Classify it.
  • Cluster it.
  • Idenfity Association within it.

We will explore those 5 tasks in the next slides in details.

Tasks of data scientist

  • Description
    • This involved conducting Exploratory Analysis & Descriptive statistics.
    • The data scientist is only summrizing the data. No interpertatins is done at this task level.

Example of description

Here is a students data that includes GPA and GRE scores, rank of the school they applied to and whether they got admited or not.

##  admit          gre             gpa        rank   
##  YES:127   Min.   :220.0   Min.   :2.260   1: 61  
##  NO :273   1st Qu.:520.0   1st Qu.:3.130   2:151  
##            Median :580.0   Median :3.395   3:121  
##            Mean   :587.7   Mean   :3.390   4: 67  
##            3rd Qu.:660.0   3rd Qu.:3.670          
##            Max.   :800.0   Max.   :4.000

Tasks of data scientist

  • Estimation & prediction
    • We use categorical or/and numerical predictors to predict/estimate numerical target variable
    • In this task we try to uncover trends and understand the relationship between the variables.
    • If we use this knowledge and extend it to new data sets, then we are trying to predict values.

Estimation & prediction

In this standard linear regression model. We visualize the relationship between GPA & GRE. We also can exrapolate the model and make predictions.

  • Classification
    • We use categorical or/and numerical predictors to predict/estimate categorical target variable

  • Clustering
    • Grouping of records, observations, or cases into classes of similar objects. No need for target variable

  • Association.
    • Finding which attributes “go together.” Most prevalent in the business world, where it is known as market basket analysis

CRISP-DM

CRISP-DM : Cross-industry standard process for data mining

CRISP-DM

  1. Business/Research Understanding Phase
      • First, clearly define the project objectives and requirements in terms of the business or research unit as a whole.
        • Then, translate these goals and restrictions into the formulation of a data science problem definition.
        • Finally, prepare a preliminary strategy for achieving these objectives

Business/Research Understanding Phase

Business Understanding Example

  • An oil and gas company is seeking to reduce the operation fatalities.
    • Classification Problem.
  • A retail store want to identify their costumers market segments.
    • Clustering Problem.

CRISP-DM

  1. Data Understanding Phase
      • First, collect the data
        • Then, use exploratory data analysis to familiarize yourself with the data, and discover initial insights.
        • Evaluate the quality of the data.
        • Finally, if desired, select interesting subsets that may contain actionable patterns.

Data Understanding Details.

In order to understand data science, it is important to understand the nature of databases, data collection and data organization.

We need to understand the differences between databases, data warehouses, and data sets.

What is common among them is that they all have rows and columns.

Data Understanding Details.

A database is an organized grouping of information within a specific structure.

The tables are related by the single column they have in common: Owner_ID. By relating tables to one another, we can reduce redundancy of data and improve database performance. The process of breaking tables apart and thereby reducing data redundancy is called normalization

Data Understanding Details.

Most relational databases which are designed to handle a high number of reads and writes (updates and retrievals of information) are referred to as OLTP (online transaction processing) systems

It is not useful to analyze that OLTP data directly. We need to query the tables to get meaningful data. Queries are usually written in a language called SQL (Structured Query Language; pronounced ‘ sequel’)

Querying OLTP is usually very intensive and time consuming on even the most robust computers. That’s why we use Data warehouse

Data Understanding - Practical Example

Northwind Database

Data Understanding Details.

A data warehouse is a type of large database that has been denormalized and archived. Denormalization is the process of intentionally combining some tables into a single table in spite of the fact that this may introduce duplicate data in some columns (or in other words, attributes). Systems that perform this process are called OLAP (online analytical processing).

Data Understanding Details.

Becauase we data where houses are huge, we usually don’t work on all of it at once. We pick a subset from this large table. Those subsets are called Data Sets.

a subset of data that is designed to a business function is called Data Mart.

Data Understanding Details - Summary

CRISP-DM

  1. Data Preparation Phase
      • This labor-intensive phase covers all aspects of preparing the final data set, which shall be used for subsequent phases, from the initial, raw, dirty data.
        • Select the cases and variables you want to analyze, and that are appropriate for your analysis.
        • Perform transformations on certain variables, if needed.
        • Clean the raw data so that it is ready for the modeling tools

CRISP-DM

  1. Modeling Phase
      • Select and apply appropriate modeling techniques.
        • Calibrate model settings to optimize results.
        • Often, several different techniques may be applied for the same data mining problem.
        • May require looping back to data preparation phase, in order to bring the form of the data into line with the specific requirements of a particular data mining technique.

CRISP-DM

  1. Evaluation Phase
      • The modeling phase has delivered one or more models. These models must be evaluated for quality and effectiveness, before we deploy them for use in the field.
        • Also, determine whether the model in fact achieves the objectives set for it in phase 1.
        • Establish whether some important facet of the business or research problem has not been sufficiently accounted for.
        • Finally, come to a decision regarding the use of the data mining results

CRISP-DM

  1. Deployment Phase
      • Model creation does not signify the completion of the project. Need to make use of created models..
        • Example of a simple deployment: Generate automated reports a report
        • Example of a more complex deployment: Implement a parallel data mining process in another department.
        • Finally, come to a decision regarding the use of the data mining results

CRISP-DM in the real world

Homework